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Summary. When collecting geocoded confidential data with the intent to disseminate, 
agencies often resort to altering the geographies prior to making data publicly available 
due to data privacy obligations. An alternative to releasing aggregated and/or perturbed 
data is to release multiply-imputed synthetic data, where sensitive values are replaced with 
draws from statistical models designed to capture important distributional features in the 
collected data. One issue that has received relatively little attention, however, is how to 
handle spatially outlying observations in the collected data, as common spatial models often 
have a tendency to overht these observations. The goal of this work is to bring this issue 
to the forefront and propose a solution, which we refer to as “differential smoothing.” After 
implementing our method on simulated data, highlighting the effectiveness of our approach 
under various scenarios, we illustrate the framework using data consisting of sale prices of 
homes in San Francisco. 
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1 Introduction 


When collecting confidential data with the intent to disseminate, there is often both an 
ethical as well as legal obligation for agencies to protect the privacy of data subjects’ identities 
and sensitive attributes. This charge can be particularly challenging for agencies who seek to 
include fine levels of geography (e.g., latitude/longitude) in the public use files they provide. 
While data users can benefit greatly from this detailed spatial information, this can also 
enable ill-intentioned users to identify individuals in the dataset. This disclosure risk can be 
especially high in regions where individuals with sensitive attributes may be more unique. 

As a result, agencies often resort to altering (or worse, suppressing) the geographies 
and/or sensitive attributes before making data publicly available. A common technique is 
to aggregate data from the individual level to areal units (e.g.. Census tracts or counties). 
Not only can this destroy the ability to estimate the spatial structure at finer geographies 


than the aggregate level, but it may also lead researchers to make ecological fallacies (Freed¬ 


man 


2004 Lawson et ah, 2012 Bradley et ah, 2015). Agencies may also randomly move 


each record’s observed location to another location, e.g., within some radius r of the true 
location. In addition to having a negative impact on the spatial structure in the released 


data (e.g., Armstrong et ah, 1999 VanWey et ah, 2005), the effect of this perturbation may 
be overlooked by researchers, potentially resulting in false conclusions. 

An alternative to releasing aggregated and/or perturbed data is to release multiply- 
imputed synthetic data, where sensitive values are replaced with draws from statistical mod¬ 
els designed to capture important distributional features in the collected data. In some cases. 


agencies may generate fully synthetic data (Rubin, 1993 Reiter, 2002, 2005; Raghunathan 


et al., 2003; Quick et al., 2014), in which the released datasets are comprised entirely of sim¬ 


ulated records. We, however, take a partially synthetic approach in which only a collection 


of values/variables are replaced with imputed values (Little, 1993; Kennickell, 1997 Abowd 


















































and Woodcock, 2004 Reiter, 2003, 2004 An and Little, 2007; Toth, 2014). Specifically, we 


assume the data consist of exact geographic locations and covariate information for each 
individual, as well as a continuously varying response which will be multiply imputed. 

One issue that has yet to be adequately addressed, however, is how to handle spatially 
outlying observations in the collected data. For instance, suppose the agency would like to 
release annual income data for individuals from a number of subpopulations for a given city. 
Further, suppose a Census tract contains only one black female over 50 years of age. Were 
the agency to release aggregate data, it is likely that this Census tract’s income information 
would be suppressed for this particular subpopulation in order to protect this individual’s 
privacy. When generating (fully or partially) synthetic data, however, such steps to protect 
the individual’s privacy may not even be considered, much less taken. Furthermore, such a 
crude method is ignorant to the size of a given areal unit — e.g., the sole individual in an 
urban Census tract (where tracts may be more densely clustered) may in fact be at less risk 
of disclosure than one of a handful of individuals in a rural Census tract which stretches over 
an area of several miles. As this issue is better illustrated in a partially synthetic framework, 
we focus here on the partially synthetic (henceforth referred to as simply “synthetic”) case. 


That said, this issue still pertains to methods for generating fully synthetic data like Quick 


et ah (2014), though the impact is lessened due to the possibility of no synthetic observations 


near the locations of the spatial outliers. 

We would be remiss not to mention the “robust kriging” literature, a concept proposed by 


Hawkins and Cressie 

(1984 

). As discussed further by 

Nirel et al. 

(1998 

) and 

Mugglestone et al. 


(2000), the goal of robust kriging is to develop methods of obtaining parameter estimates 
which are robust to observations whose responses are outlying (or otherwise not in line with 
model assumptions). This is in contrast to our focus here, where we are concerned with 
observations whose locations are considered outlying and how this relates to disclosure risk. 
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The goal of this work is to bring this issue to the forefront and propose a solution. We 
begin in Section 2 by illustrating, in detail, the potential risks and how existing approaches 
fail to address the root of the problem. In Section 3, we extend existing methods for generat¬ 
ing synthetic data to further reduce disclosure risk for spatially outlying observations using 
a concept we refer to as differential smoothing. We implement these methods on simulated 
data in Section 4, highlighting the effectiveness of our approach under various scenarios. We 
then apply the methodology to data consisting of sale prices of homes in San Francisco in 
Section 5. While privacy is not necessarily an issue for these data, they serve as a reasonable 
surrogate for household-level data, where disclosure risks would be of chief concern. Finally, 
in Section 6, we provide concluding remarks and some ideas for future research. 

2 Potential Disclosure Risks in Synthetic Data 

Before discussing the potential risks when generating synthetic data, we must hrst select 
a method for modeling the true data. For the sake of illustration, we shall assume that 
the data consists of continuous responses (e.g., annual income) from a single population. 
While datasets generally consist of data collected from multiple populations (e.g., race, 
socioeconomic status, etc.), we will restrict our attention to the univariate case; the topic of 
joint modeling is discussed further in Section 

Let Si and Y (s*) be the location and response variable for the Fth individual, for i = 
1,... ,N. For a continuously varying Y (s) and vector of model parameters, 0, we may choose 
a model of the form 

F(sj)|0 ~ A^(x(si)'/3-Ftc(si),r^) (1) 

where x(sj) is a vector of spatially varying covariates with a corresponding vector of re¬ 
gression coefficients, f3, and w(sj) is a random effect that induces correlation between the 
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responses. To account for spatial correlation in the responses, a highly flexible option is to 
assume w(s) is a mean-zero Gaussian process, GF(0, K(-, ■; <^)), where K{si,Sj;a^,(()) = 

Cov{w{si),w{sj)). For a collection of spatial locations, S = {si, ..., stv}, we dehne w = 
{tc(si),..., w(sAr)}' and assume ~ (0, (a^, (/>)), where the (z,j)-th ele¬ 

ment of T,w (cr^,0) is iF(sj, Sj; (T^, 0). For the sake of brevity, we suppress the conditioning 
and simply write K{si,Sj) and Eyi/. We dehne Kj to be the (A^ — l)-dimensional vector 
with components K{si,Sj) for i ^ j. While there are numerous choices for we will 

illustrate our approach using an exponential covariance structure where Coy{w{si),w{sj)) = 
cr^ exp {—01 |sj — s0|}. Here, represents the variance of the spatial process and 0 denotes 
the spatial range, yielding 6 = (/3, w, r^, 0) as the parameters to be estimated. When N 

is large, inverting 'Ew can be computationally burdensome, and a low-rank approximation 


such as the modihed predictive process (Banerjee et ah, 2010) may be required. While the 
approach we describe in Section can be implemented using a low-rank approximation, for 
the sake of illustration, we will assume is of a manageable size. This will allow us to focus 
on the properties of our approach rather than details of the low-rank approximation. 

To illustrate the potential disclosure risk in synthetic data, we generate N = 500 obser¬ 
vations from Q where = 0.0625, = 4, 0 = 12.7, and locations on the unit square; the 


individuals are shown in Figure 1(a), overlaid on the true response surface. This choice for 
0 corresponds to Cor(w(sj),w(sj)) < 0.05 for ||sj — Sj|| > V2/Q ^ 0.23, and the values for 
and cx^ were chosen such that the ratio of to was large—the impact of this ratio can 


be seen in Section 3.2 The observation at location (0.51, 0.01) in Figure 1(a) is further than 
0.26 units away from the remaining 499 observations, and is henceforth referred to as the 
“spatial outlier.” Without loss of generality, we assume this is the 77-th observation in the 


dataset. Later in Section 3.2, we will identify spatial outliers using a more relaxed dehnition. 
To model these data, we may use an intercept-only model, assume an exponential co- 
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(a) True Response 


(b) Estimated Response (c) Synthetic Responses at 

( 0 . 51 , 0 . 01 ) 



Figure 1: Panels (a) and (b) display the true and (unrestricted) estimated response surfaces 
for the data. Locations are denoted by circles for non-at-risk individuals and red triangles for 
the at-risk individuals. Panel (c) displays the distribution of L = 500 synthetic individuals 
at location (0.51,0.01), generated using the surface in panel (b). 


variance structure for the spatial random effects, and take a Bayesian approach, completing 
the model specihcation by dehning vague priors for the model parameters. After htting 
the Bayesian hierarchical model and obtaining posterior distributions for the parameters, 
we achieve the estimated response surface shown in Figure l(b)[ Of particular importance 
here is the presence of a ring encircling the spatial outlier, around which the predicted 
values appear to gradually decrease from the estimate of /do = 11.44 outside the ring to 


V (s/v) = /3o + w(sjy) = 7.16, which may be considered too close to the true value of 7.08. 

Given our existing spatial locations, we can generate L = 500 partially synthetic datasets 
by sampling synthetic responses, denoted V (sj)l(^\ from the posterior predictive distribution 


r(s,)tw I 6/W ~ N [(5^^ + {r^}' 


using the methods described in Quick et ah (2014) for marked point processes, where 
/Sq^^ w{siY^\ and denote the Gth approximately independent samples from the respective 


posterior distributions for £ = 1,..., L, and i = 1,..., A^. Figure l(c)| displays a histogram 
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of the 500 synthetic responses for the spatial ontlier. Alarmingly, this empirical distribntion 
is almost perfectly centered aronnd the trne valne for Y{sisf) = 7.08, denoted by the red line. 

In essence, htting a spatial model for data with spatial ontliers may lead to overhtting in 
the vicinity of the ontliers. Fnrthermore, non-model-based methods for smoothing may also 


yield potentially nnsatisfactory resnlts. For instance, Zhon et ah (2010) show that replacing 
y (sj) with y(sj) = — where > 0 is some spatially-associated 

weight fnnction with = 1 — can prodnce synthetic data with decreased 

risk. Unfortnnately, if s^r is a spatial ontlier, this can still resnlt in Y{s]s[) ~ Y{s]s[) when 
fF(sj, Sfc) 0 for A; 7 ^ 2 for a distance-based choice of hF(-, ■). While this conld be avoided by 
imposing a “disclosnre constraint,” this may be detrimental to the remaining observations. 
Needless to say, this is a problem that is easy to overlook yet difficnlt to fnlly address. 


3 Differential Smoothing Framework 

3.1 Background for Bayesian spatial models 

Using the model in ([^, we can write Y | 0 ~ -|-w, Sy) and w | cr^, 0 ~ Y(0, where 

Y = {Y(si),... , Y(s7v)}', = {/x(si),... ,/i(sAr)}', /x(sj) = x(si)'/3, and Sy is a diagonal 

matrix with elements r^. We can then show that the fnll conditional distribntion for w is 

w I ■ ~ Y ([Eyi + Sy^(Y - ^), . (2) 

To £t this model nnder a Bayesian framework, we must specify prior distributions for our 
remaining model parameters: f3, cx^, (j), and r^. 

Again, we suppose that the Y-th observation is a spatial outlier—and thus is determined 
to have a high disclosure risk—while the remaining Y — 1 observations are clustered together 
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and treated as having no disclosnre risk. In order to acconnt for this in the model, we dehne 
“risk weights” Oj G [ 0 , 1 ] which will be nsed to differentially smooth the predicted snrfaces. 
We then dehne the diagonal matrix A as having elements An = 1/a/1 + yoi where 7 > 0 
denotes a “global risk” parameter, and dehne w* = Aw. Now, if we want to partition the 
observations based on risk, we would have 



/ 



A[n)^W,(N)^{N) ^{N)^n/ a/1 + lO^N 

\ 

(T^, 0 ~ iV 







V 



K/rA(Ar)/Vl + 7ajv 0-^/(1 + ICLn) 

) 


where A(jv) and Svv,(v) denote the (A^ — 1) x (iV — 1) matrices constructed by removing the 
last row and column of A and respectively. 


3.2 Defining the and 7 


Rather than dehne Oj on a continuum, a simple option is to let a, = 1 if the i-th observation is 
deemed a spatial outlier and = 0 otherwise. For instance, we may consider the i-th obser¬ 
vation as an outlier if the distance to the nearest neighbor, miuj^j ||sj —Sj|| > M for some M. 
To dehne M, we may choose a specihcation based on an inversion of the correlation structure 
used, such as M{(f)) > —(logO.2O)/0 — which ensures that maxjyj Cor (t(;(sj), w(Sj)) > 0.20 
for non-outliers. While there is no theoretical basis for this choice, we have found that it 
ohers a compromise between the utility and the disclosure risk of the synthetic data we 
generate. Updating (|^ with this restriction yields 


fM 


Sw,(V) KAr/A/1 + 7 

\ 

Iw 


K'tv/VI + 7 (^V(l + 7) 

/ 


( 4 ) 


We discuss the topic of continuous-valued a* later in Section 
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Choosing a value for 7 can be less clear, so first we need to investigate how different 
values for 7 affect the model. To better elucidate this, suppose s^v is sufficiently far away 


from the other points such that exp(—0||sAr — Sj||) 0 for j 7 ^ N; i.e.. 


Cov (w*^),w(SAr)*) 


^W,{N) 0 

0 ' aV(l + 7) 


Plugging this into the full conditional distribution for w* 


(which takes the form of ([^ with 


Ew replaced by AT,wA) yields 


E[w* I ■] = 


-1 


and V[ 


w ■ = 


-1 

UN) 

+ 

y-l 

^W,{N) 

y-1 

^Y,{N) 

0 



0' 


Y/{l+-y) 




T2+cr2/(l+7) 

-1 

y(7V) 

+ 

y-1 

^W,{N) 

-1 

y-1 

Y,{N) 

0 



0' 


Y/ii+j) 






/ V \ 

1 (N) — fJ'(N) 

/(’l+.y) - h(S7v)y 


Note that this implies that the conditional expected value of w*{s]y) is a weighted average 
of Y{s]\f) — /i(sAr) and the prior mean of 0 ; i.e.. 


h;[w;*(sjv)|-] = 


+ 7 ) 


+ cr2/(l + 7) 

= a {Y{sn) - /i(sv)) + (1 


(F(s7v) - /r(sAr)) + 


+ (j2/(l + 7) 


( 0 ) 


«)( 0 ), 


(5) 


where a G [0, + r^)] denotes the degree of spatial smoothing. Note that setting 

7 = 0 yields a = + r^), which results in the standard unrestricted model. When 

choosing a non-zero, hnite value for 7 , one option may be to force a to take some value in 
(0,(T^/[cr^ -|- r^]) to achieve a desired level of differential smoothing. For instance, if a = 1/ 
2, this corresponds to 7 = — 1, provided > r^. To achieve a “fully smoothed” 


















process for our outlying observations, however, we let a = 0 , which corresponds to 7 = cx). 
Furthermore, note that this restriction forces E[w*{s]s[) \ ■] = ld[tc*(sAr) | ■] = 0; i.e., if we let 
7 = 00 , this implies w*{sn) = 0 (note that w*{sn) = 0 does not imply tc(sAr) = 0 ). 

3.3 Implementation 

To implement our differential smoothing approach, we hrst £t the unrestricted model: 

7r(/3, w, cr^, 0, | Y) ociV(Y | + w, Sy) x N{w \ 0, 'Ew) x 7r(/3, cr^, 0, r^), (6) 

with a* = 0 for all i (or 7 = 0) and using vague prior specifications for /3, cr^, 0, and r^, where 
7i{x I y) denotes the conditional distribution of x given y. We could then specify Oj and 7 to 
remain functions of our model parameters (i.e., 0 ^( 0 ) and 7 ((j^,r^)), changing the degree of 
smoothing adaptively. As we will discuss in Section however, this may have consequences 
regarding parameter estimation (e.g., the loss of conjugacy for cr^), and thus we do not pursue 
this here. Instead, we specify the Oj using the distance to the nearest neighbor (as a function 
of the posterior median of 0 from our unrestricted model) and implement a fully smoothed 
restriction by setting 7 = cx). We then £t the restricted hierarchical model 

7r(/3, w, cr^, 0, | Y) ocY(Y | ^ + Aw, Sy) x Y(w | 0, Ew) x 7r(/3, cx^, 0, r^), (7) 

using these values of a* and 7 . To facilitate faster convergence, we can use samples from the 
unrestricted model as initial values for the restricted model, and we recommend hxing 0 so 
as not to affect which observations are to be deemed “spatial outliers”. 

Using the samples drawn from the posterior distribution based from the restricted model, 
we then generate synthetic data from Y (s)b^) | 

Here again, note that if we use the fully smoothed approach where 7 = 00 , the are 
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simply draws from the conditional prior distribution, w{sj^f) \ W(Ar). 

4 Simulated Example 

Before delving into an assessment of the proposed method, we will hrst describe the motiva¬ 
tion for the simulated example used both here and in Section The response is intended to 
correspond to an individual’s log-transformed income, centered around an annual income of 
roughly $50,000 with a small proportion of the sample having incomes higher than $1,000,000 
and some individuals having incomes below the poverty line. The observations are sampled 
such that the majority of the data come from a high density region of the spatial domain, 
while a few of the individuals reside in less densely populated regions (with respect to the 
subpopulation being sampled). As is common with real data, the simulated data contain 
pockets of both high and low income individuals (in practice, agencies tend to release top- 


coded income data (e.g., see Crimi and Eddy, 2014), another data-privacy method which can 


result in bias). To achieve this in these data, we generated from the model where = 0.0625, 
w I (j^, 0 ~ A^(0, ^w) with = 4 and 0 = 12.7, and 

r(si) |M;(si),r2 ~ Ar(ii + 0.25 x | |sii - 0.25| | + 0.25 x | |si 2 - 0.5| | + w(si), r^), (8) 


As displayed in Figure 1(a), we observe a spatially outlying individual at (0.51,0.01) in a 


relatively low income bracket who we have identihed as being at-risk for disclosure. Using 
the methods described in Section]^ we will demonstrate our differential smoothing approach 
for protecting this and other individuals. We will also compare these results to those from 
an analysis where the spatial outlier was removed from the data prior to model htting. 

After htting the unrestricted hierarchical model in (|^, we consider the restricted model 
of Section]^ where we let be a 0/1 indicator function for the absence of neighbors within 
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Figure 2: Estimated response surfaces from the unrestricted and restricted models using the 
simulated data. 


M = — log(O.2O)/0 = 0.13 units, resulting in 2 additional at-risk individuals (denoted using 
red triangles in Figure]^. We then let 7 = 00 , forcing w*{si) = 0 for the at-risk observations, 
while leaving the non-at-risk observations relatively unaffected. Rehtting the model under 


this specihcation, we obtain the estimated response surface in Figure 2(b) Comparing this 


hgure to the unrestricted surface in Figure 2(a), a number of features are noticeable. First, 
as shown in Table [T} the estimate of (3q has decreased from 11.35 to 10.31, largely due to 
the negative pull of the outlying observation at (0.51,0.01), resulting in lower predictions 
for the unobserved regions on the right side of the spatial domain. Secondly, the ring of 
low predicted values around the spatial outlier has vanished, resulting in a surface that is 
essentially naive to the existence of this individual. Additionally, note that the estimate of 
cr^ in our restricted model is similar to that from the analysis of the suppressed dataset, 
while the estimates of (3o and differ substantially. This is because we cannot learn about 
w(sAr) in either model, leaving and e(sAr) to do more work in the restricted model. 

We now turn our attention to the synthetic data generated from these models. Figure]^ 
displays the distributions of the synthetic responses for the spatial outlier. In each panel, 
the true value for this individual is denoted by the red vertical line, while the histogram 


11 
















Model 

/do 



Full Unrestricted 

Restricted 

Suppressed 

11.35 (11.13, 11.71) 
10.31 (10.20, 10.77) 
11.71 (11.49, 12.07) 

3.83 (3.22, 4.59) 
3.77 (3.07, 4.53) 
3.81 (3.19, 4.58) 

0.06 (0.05, 0.08) 
0.12 (0.10, 0.15) 
0.06 (0.05, 0.08) 


Table 1: Parameter estimates from each of our hierarchical models. Note the effect of the 
spatial outlier whose value (7.06) is much less than the mean of the data (10.76). 


for the restricted model also contains a green line denoting the mean of the unrestricted 
synthetic responses (for comparison purposes) and a blue line denoting the mean for the 
set of restricted responses. Here, we see the impact of the smoothing techniques in the 
restricted model, as now the synthetic responses are centered around the estimate for (3^ 
from Table instead of the true value of 7.08. Recalling that these responses are modeled 
after log-transformed annual incomes, we can assess the disclosure risk for this individual by 


computing the proportion of synthetic incomes within a certain e of the truth (see, e.g.. Quick 


et ah, 2014). For instance, 100% of the synthetic incomes from the unrestricted model are 


within $10,000 of the true value, compared to only 30% for our restricted model. Similarly, 
the proportion of synthetic incomes within 10% of their true values for our three at-risk 
individuals has been reduced by at least 73% and by an average of 20% for the non-at-risk 
individuals. To assess the utility of the synthetic data from our models, we £t 


— /3q^ ^ -1- /3|^ — 0.25|| -f ( 3^2 ’*ll'Sj2 ~ 0.5|| -I- e(sj) 


?tWi 


tw 


for £ = 1,..., L and used the combination rules in Reiter (2003) to obtain point and interval 
estimates for our regression parameters from each model. Table displays these results for 
our unrestricted and restricted models, as well as those corresponding to the analysis of the 
suppressed data. In general, our regression parameters, f3\ are relatively unaffected, though 
this is not surprising given the small number of at-risk observations. 
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(a) Unrestricted (b) Restricted 


Figure 3: Distributions of the synthetic responses for the spatial outlier from the unrestricted 
and restricted models using the simulated data. 


Parameter 

Full Unrestricted 

Restricted 

Suppressed Unrestricted 

/3q (Intercept) 
dj {sii slope) 
(3l (Si2 slope) 
(STV) 

10.58 (10.31, 10.85) 
-0.16 (-0.33, 0.02) 
0.37 (0.20, 0.55) 
7.18 (6.57, 7.80) 

10.57 (10.3, 10.84) 
-0.16 (-0.34, 0.02) 
0.40 (0.22, 0.57) 
10.33 (6.21, 13.91) 

10.54 (10.27, 10.80) 
-0.14 (-0.32, 0.04) 

0.42 (0.24, 0.60) 

11.86 (8.26, 15.63) 


Table 2: Parameter estimates from the simulated example. Note: the estimates for /3^ from 
the unrestricted model mirror those from a fit of the real data, thus these results have not 
been shown for the sake of brevity. 

5 Real Data Example 


Having illustrated the potential risks of the common, unrestricted model and demonstrating 
the effectiveness of our differential smoothing approach, we now look to apply our method¬ 
ology to a dataset of home sale prices in San Francisco for the period from Feb. 2008 to 


July 2009. These data were collected and described by Adler (2010) and consist of the sale 
price, the square footage, the number of bedrooms, and the spatial location (latitude and 
longitude) for each home. For the purposes of this paper, we will restrict our attention to the 
214 homes with one bedroom. While these data themselves are not considered “at-risk” for 
disclosure (e.g., home listings are publicly available), the number of bedrooms and the home 
value may reasonably be considered as surrogates for sensitive household information such 


13 





































(a) Unrestricted 


(b) Restricted 


Figure 4: Estimated response surfaces from the unrestricted and restricted models using the 
San Francisco home sales data. 

as the size of a household and the total household income, respectively. Thus, we believe the 
dependencies underlying these data are representative of those underlying data for which 
disclosure risk would be of concern. 

Following the process used in Section]^ we hrst model the log-transformed sale prices us¬ 
ing the unrestricted hierarchical model in using the square footage as a covariate, yielding 


the prediction surface for w{-) in Figure 4(a) Here again, we see “rings” in the prediction 


surface surrounding a number of potential spatial outliers (denoted by red triangles). Based 
on the results presented in Section one can intuit that synthetic responses generated from 
this prediction surface for these outliers may be unacceptably close to their true values, thus 
motivating the use of differential smoothing. Fortunately, the ratio of (~ 0.13) to 
(;^ 0.043) is not as dramatic as in our simulated example, so the synthetic responses for the 
outlying observations are slightly shifted away from H(sj) = 13.30 toward x(si)'/3 = 14.15, 


as shown in Figure 5(a) for the observation at (—122.48,37.76). 


We then proceed to ht the restricted model. Based on the distances to their nearest 
neighbors, we identify seven homes as spatial outliers. Using this approach, we obtain the 
predicted surface for w{-) in Figure 4(b)| and the synthetic data in Figure 5(b)[ As in the 
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(a) Unrestricted 


(b) Restricted 


Figure 5: Distributions of the log-transformed synthetic sale prices for the home at 
(—122.48,37.76) from the unrestricted and restricted models using the San Francisco home 
sales data. 

simulated example, this approach yields synthetic responses centered around the estimated 
value of x(sj)'/3 = 14.04 in the restricted model. To quantify this in terms of the risk 
of disclosure, the percentage of synthetic responses for the observation at (—122.48,37.76) 
which are within 10% of the true value has been reduced by 93% — dropping from 46.2% of 
our synthetic responses in the unrestricted model to just 3% in our restricted model. Overall, 
this risk was reduced 50% for at-risk individuals and 11% for non-at-risk individuals. 

Now, in order for our restricted model to be a valuable tool, it is important to demonstrate 
that it can provide synthetic data which yield statistical inference similar to that from the 
real data. To evaluate the utility of our synthetic data, we £t 




for £ = 1,..., L for each set of synthetic responses and again used the combination rules in 


Reiter (2003) to obtain point and interval estimates for our regression parameters. Here, our 


results are even more impressive than in Table as our restricted synthetic data produce 
estimates /3o = 13.233 (13.197, 13.269) and /9i = 0.269 (0.233, 0.306) — estimates which are 
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each within 0.002 of those from the real data. To put these results in context, consider that 
the estimate for /3i obtained from synthetic data generated from a model using a suppressed 
dataset is 0.279 (0.244, 0.314). 

6 Discussion 

In this paper, we have shed light on a unique issue regarding disclosure risk encountered 
when generating spatially-referenced synthetic microdata from a population with spatially 
outlying observations. After hrst illustrating an example of when this risk can arise in 
Section we proposed a framework which could be used to alleviate the risk of disclosure 
by restricting the hierarchical model using differential smoothing. We then demonstrated its 
use on simulated data and applied it to a dataset of home sale prices in San Francisco. 

Along with producing data which limit the risk of disclosure, producing data with high 
utility is of the utmost importance. While the synthetic data that we have generated in 
Sections and have been able to provide inference which was on par with those from the 
real data, this is a much more nuanced problem in practice. For instance, suppose our data 
consist of the gross annual household incomes for households in a particular region (and for 
the sake of illustration, suppose these data are not top-coded). If many of our spatial outliers 
also happen to be high earners (say, household incomes greater than $250,000 per year), 
our synthetic data will likely underestimate the number of high earners in the population. 
Fortunately, such issues can be addressed by constructing our hierarchical models based on 
important questions of inferential interest. If we desire synthetic data which preserve the 
number of households in certain income brackets, we can specify conditional models such as 

Y (si) I Y (si) G Gfc, /3, w,t‘^ ^ N (x(si)/3 -f- tc(si), r^) x / {y (sj) e Gk} , (9) 
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where I {h^(Sj) G Gk\ is an indicator function ensuring that h^(sj) belongs to a particular 
group, denoted Gk- That is, we could model each household’s income using a truncated nor¬ 
mal distribution, generating synthetic households that belong to the correct income brackets 
and preserving the proportions observed in the real population. While such a model would 
reduce data privacy — i.e., we must be willing to disclose a household’s true income bracket 
— data stewards know this risk beforehand and can take appropriate measures. 

While our work here was focused on scenarios with a single population and Gaussian out¬ 
comes, the framework we have presented can easily be extended to a multivariate framework 
and/or for use in generalized linear mixed models. For instance, the value of a residence in 
San Francisco is likely a function of the location (s,), number of bedrooms (/c), the square 
footage (SqFt^^), and the age of the property (in years; Age 2 k)- To model the age of the 
property using differential smoothing, we could let 

Age 2 fc(Si) I 70 , Wage ~ Pols (exp [70 -h Wage(Si)]) , (10) 

where 70 is an intercept term and Wage{s) is a differentially smoothed spatial process. Then, 
to model the property’s value, we could let 

Tfc(si) I ~ IV {/3ok + SqFtifc(si)/3ifc -f- Age2k{si)/32k + Wfc(si), r/) , k = 0,...,K (11) 

where Ok = and w(s) = {wo{s),... ,WK{s)y is a differentially smoothed multi¬ 

variate spatial process. In this model, predictions for an outlying observation at location Sj 
with k bedrooms would be based on its group-specihc regression model, as well as a function 
of the observations near Sj with a different number of bedrooms. For instance, in a region 
comprised primarily of small condominiums, the spatial surfaces for studio (no bedroom) 
and one-bedroom units could help inform the surface for rarer two-bedroom units. 
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We conclude by acknowledging that this work is just a first step toward achieving reduced 
disclosure risk. One drawback of the restricted model used here is that it treats the idea of 
being a spatial outlier as a binary decision. In our future work, we aim to devise an approach 
which dehnes a* continuously over the range [0,1]. One option would be to dehne 

aj(0) = 1 — exp I —(f) min | |sj — s^ 

V 

and 7 (a^, r^) = — 1 as explicit functions of the parameters (f>, and r^, and account 

for these dehnitions in our MCMC sampler. While this is conceptually straightforward, it is 
unclear how such a framework would affect the convergence of our model parameters, much 
less whether these particular definitions are optimal. 
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